Skip to content

graph : use F32 accumulators for gpt-oss #15312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Aug 14, 2025

ref #15274

Request F32 accumulators for the attention output multiplication. This is similar to the existing hint for GLM models in llm_graph_context::build_attn():

llama.cpp/src/llama-graph.cpp

Lines 1474 to 1482 in 810b9fc

if (wo) {
cur = build_lora_mm(wo, cur);
if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) {
// GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators
ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
}
}

Here we add the same hint for the new llm_graph_context::built_attn_with_sinks():

llama.cpp/src/llama-graph.h

Lines 734 to 738 in 810b9fc

// TODO: temporary to keep the diff small. after the code is public will refactor to simplify this
ggml_tensor * build_attn_with_sinks(
llm_graph_input_attn_kv_unified_iswa * inp,
ggml_tensor * wo,

This build_attn_with_sinks() path currently exists temporary and will eventually be merged in the llm_graph_context::build_attn().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant